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Abstract 


We study contextual bandits with budget and time constraints, referred to as con¬ 
strained contextual bandits. The time and budget constraints significantly com¬ 
plicate the exploration and exploitation tradeoff because they introduce complex 
coupling among contexts over time. To gain insight, we first study unit-cost sys¬ 
tems with known context distribution. When the expected rewards are known, we 
develop an approximation of the oracle, referred to Adaptive-Linear-Programming 
(ALP), which achieves near-optimality and only requires the ordering of expected 
rewards. With these highly desirable features, we then combine ALP with the 
upper-confidence-bound (UCB) method in the general case where the expected 
rewards are unknown a priori. We show that the proposed UCB-ALP algorithm 
achieves logarithmic regret except for certain boundary cases. Further, we de¬ 
sign algorithms and obtain similar regret bounds for more general systems with 
unknown context distribution and heterogeneous costs. To the best of our knowl¬ 
edge, this is the first work that shows how to achieve logarithmic regret in con¬ 
strained contextual bandits. Moreover, this work also sheds light on the study of 
computationally efficient algorithms for general constrained contextual bandits. 


1 Introduction 

The contextual bandit problem is an important extension of the classic multi-armed bandit 

(MAB) problem m, where the agent can observe a set of features, referred to as context, before 
making a decision. After the random arrival of a context, the agent chooses an action and receives 
a random reward with expectation depending on both the context and action. To maximize the 
total reward, the agent needs to make a careful tradeoff between taking the best action based on the 
historical performance (exploitation) and discovering the potentially better alternative actions under 
a given context (exploration). This model has attracted much attention as it fits the personalized 
service requirement in many applications such as clinical trials, online recommendation, and online 
hiring in crowdsourcing. Existing works try to reduce the regret of contextual bandits by leveraging 
the structure of the context-reward models such as linearity a or similarity a, and more recent 
work Q focuses on computationally efficient algorithms with minimum regret. For Markovian 
context arrivals, algorithms such as UCRL a for more general reinforcement learning problem can 
be used to achieve logarithmic regret. 

However, traditional contextual bandit models do not capture an important characteristic of real 
systems: in addition to time, there is usually a cost associated with the resource consumed by each 
action and the total cost is limited by a budget in many applications. Taking crowdsourcing as 
an example, the budget constraint for a given set of tasks will limit the number of workers that an 
employer can hire. Another example is the clinical trials Go), where each treatment is usually costly 
and the budget of a trial is limited. Although budget constraints have been studied in non-contextual 
bandits where logarithmic or sublinear regret is achieved E] [El [El [TH [T5] [la, as we will see 
later, these results are inapplicable in the case with observable contexts. 
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In this paper, we study contextual bandit problems with budget and time constraints, referred to 
as constrained contextual bandits, where the agent is given a budget B and a time-horizon T. In 
addition to a reward, a cost is incurred whenever an action is taken under a context. The bandit 
process ends when the agent runs out of either budget or time. The objective of the agent is to 
maximize the expected total reward subject to the budget and time constraints. We are interested in 
the regime where B and T grow towards inhnity proportionally. 

The above constrained contextual bandit problem can be viewed as a special case of Resourceful 
Contextual Bandits (RCB) ifTTl . In IflTl . RCB is studied under more general settings with possibly 
infinite contexts, random costs, and multiple budget constraints. A Mixture .Elimination algorithm is 
proposed and shown to achieve 0{\/T) regret. However, the benchmark for the dehnition of regret 
in lllTll is restricted to within a finite policy set. Moreover, the Mixture .Elimination algorithm suffers 
high complexity and the design of computationally efficient algorithms for such general settings is 
still an open problem. 

To tackle this problem, motivated by certain applications, we restrict the set of parameters in our 
model as follows: we assume finite discrete contexts, fixed costs, and a single budget constraint. This 
simplihed model is justified in many scenarios such as clinical trials m and rate selection in wire¬ 
less networks US). More importantly, these simplihcations allow us to design easily-implementable 
algorithms that achieve 0{\ogT) regret (except for a set of parameters of zero Lebesgue measure, 
which we refer to as boundary cases), where the regret is dehned more naturally as the performance 
gap between the proposed algorithm and the oracle, i.e., the optimal algorithm with known statistics. 

Even with simplihed assumptions considered in this paper, the exploration-exploitation tradeoff is 
still challenging due to the budget and time constraints. The key challenge comes from the complex¬ 
ity of the oracle algorithm. With budget and time constraints, the oracle algorithm cannot simply 
tie the action that maximizes the instantaneous reward. In contrast, it needs to balance between 
the instantaneous and long-term rewards based on the current context and the remaining budget. In 
principle, dynamic programming (DP) can be used to obtain this balance. However, using DP in 
our scenario incurs difficulties in both algorithm design and analysis: hrst, the implementation of 
DP is computationally complex due to the curse of dimensionality; second, it is difficult to obtain 
a benchmark for regret analysis, since the DP algorithm is implemented in a recursive manner and 
its expected total reward is hard to be expressed in a closed form; third, it is difficult to extend the 
DP algorithm to the case with unknown statistics, due to the difficulty of evaluating the impact of 
estimation errors on the performance of DP-type algorithms. 

To address these difficulties, we hrst study approximations of the oracle algorithm when the system 
statistics are known. Our key idea is to approximate the oracle algorithm with linear programming 
(LP) that relaxes the hard budget constraint to an average budget constraint. When hxing the average 
budget constraint at B/T, this LP approximation provides an upper bound on the expected total 
reward, which serves as a good benchmark in regret analysis. Further, we propose an Adaptive 
Linear Programming (ALP) algorithm that adjusts the budget constraint to the average remaining 
budget hr It, where t is the remaining time and hr is the remaining budget. Note that although the 
idea of approximating a DP problem with an LP problem has been widely studied in literature (e.g., 
uni [HI), the design and analysis of ALP here is quite different. In particular, we show that ALP 
achieves 0(1) regret, i.e., its expected total reward is within a constant independent of T from the 
optimum, except for certain boundaries. This ALP approximation and its regret analysis make an 
important step towards achieving logarithmic regret for constrained contextual bandits. 

Using the insights from the case with known statistics, we study algorithms for constrained contex¬ 
tual bandits with unknown expected rewards. Complicated interactions between information acqui¬ 
sition and decision making arise in this case. Fortunately, the ALP algorithm has a highly desirable 
property that it only requires the ordering of the expected rewards and can tolerate certain estimation 
errors of system parameters. This property allows us to combine ALP with estimation methods that 
can efficiently provide a correct rank of the expected rewards. In this paper, we prcmose a UCB-ALP 
algorithm by combining ALP with the upper-confidence-bound (UCB) method g]. We show that 
UCB-ALP achieves 0{logT) regret except for certain boundary cases, where its regret is 0{Vt). 
We note that UCB-type algorithms are proposed in EOl for non-contextual bandits with concave 
rewards and convex constraints, and further extended to linear contextual bandits. However, EQl 
focuses on static context^ and achieves 0{Vt) regret in our setting since it uses a hxed budget 
constraint in each round, m comparison, we consider random context arrivals and use an adaptive 


'After the online publication of our preliminary version, two recent papers 12II1221 extend their previous 
work l20t to the dynamic context case, where they focus on possibly infinite contexts and achieve 0{VT) 
regret, and dll restricts to a finite policy set as 1171 . 
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budget constraint to achieve logarithmic regret. To the best of our knowledge, this is the first work 
that shows how to achieve logarithmic regret in constrained contextual bandits. Moreover, the pro¬ 
posed UCB-ALP algorithm is quite computationally efficient and we believe these results shed light 
on addressing the open problem of general constrained contextual bandits. 

Although the intuition behind ALP and UCB-ALP is natural, the rigorous analysis of their regret is 
non-trivial since we need to consider many interacting factors such as action/context ranking errors, 
remaining budget fluctuation, and randomness of context arTival. We evaluate the impact of these 
factors using a series of novel techniques, e.g., the method of showing concentration properties under 
adaptive algorithms and the method of bounding estimation errors under random contexts. For the 
ease of exposition, we study the ALP and UCB-ALP algorithms in unit-cost systems with known 
context distribution in Sectionsl^andffl respectively. Then we discuss the generalization to systems 
with unknown context distribution in section and with heterogeneous costs in Section which 
are much more challenging and the details can ne found in the supplementary material. 

2 System Model 

We consider a contextual bandit problem with a context set A’ = {1, 2,..., J} and an action set 
A = {1,2,, K{. At each round t, a context Xt aii'ives independently with identical distribution 
= j{ = TTj, j G A, and each action k G A generates a non-negative reward Yfc j. Under a 
given context Xt = j, the reward Yfc t’s are independent random variables in [0,1]. The conditional 
expectation 'E.\Yk^t\Xt = j] = uj^k is unknown to the agent. Moreover, a cost is incurred if action k 
is taken under context j. To gain insight into constrained contextual bandits, we consider fixed and 
known costs in this paper, where the cost is ^ > 0 when action k is taken under context j. Similar 
to traditional contextual bandits, the context Xt is observable at the beginning of round t, while only 
the reward of the action taken by the agent is revealed at the end of round t. 

At the beginning of round t, the agent observes the context Xt and takes an action At from {0} U 
where “0” represents a dummy action that the agent skips the current context. Let Yt and Zt be the 
reward and cost for the agent in round t, respectively. If the agent takes an action At = k > 0, 
then the reward is Yt = Yk^t and the cost is Zt = cxt,k- Otherwise, if the agent takes the dummy 
action At = 0, neither reward nor cost is incuiTed, i.e., Yt = 0 and Zt = 0. In this paper, we focus 
on contextual bandits with a known time-horizon T and limited budget B. The bandit process ends 
when the agent runs out of the budget or at the end of time T. 

A contextual bandit algorithm L is a function that maps the historical observations "Ht-i = 
(Xi, Ai,Yi; X 2 , A 2 ,Y 2 ', ... ■, Xt-i, At-i,Yt-i) and the current context Xt to an action At G 
{0} U A. The objective of the algorithm is to maximize the expected total reward Ur{T,B) for 
a given time-horizon T and a budget B, i.e., 

T 

maximizer Ur {T, B) = Er [ Ft] 


subject to Zt < B, 

t=i 

where the expectation is taken over the distributions of contexts and rewards. Note that we consider 
a “hard” budget constraint, i.e., the total costs should not be greater than B under any realization. 

We measure the performance of the algorithm L by comparing it with the oracle, which is the optimal 
algorithm with known statistics, including the knowledge of tt^’s, fc’s, and fc’s. Let U*{T, B) 
be the expected total reward obtained by the oracle algorithm. Then, the regret of the algorithm F is 
defined as 

Rr{T,B) = U*{T,B) - Ut{T,B). 

The objective of the algorithm is then to minimize the regret. We are interested in the asymptotic 
regime where the time-horizon T and the budget B grow to infinity proportionally, i.e., with a fixed 
ratio p = B/T. 

3 Approximations of the Oracle 

In this section, we study approximations of the oracle, where the statistics of bandits are known 
to the agent. This will provide a benchmark for the regret analysis and insights into the design of 
constrained contextual bandit algorithms. 
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As a starting point, we focus on unit-cost systems, i.e., Cj^k = 1 for each j and k, from Sectionj^to 
Section!^ which will be relaxed in Sectionim In unit-cost systems, the quality of action k under con¬ 
text j is mlly captured by its expected reward Uj^k- Let u* be the highest expected reward under con¬ 
text j, and k* be the best action for context j, i.e., u* = max^g^ Uj^k and k* = argmax^.^^ Uj^k- 
For ease of exposition, we assume that the best action under each context is unique, i.e., Uj^k < u*j 
for all j and k k*. Similarly, we also assume u* > U 2 > ... > u j for simplicity. 

With the knowledge of Uj^s, the agent knows the best action k* and its expected reward u* under 
any context j. In each round t, the task of the oracle is deciding whether to take action k*^^ or not 
depending on the remaining time t = T — t 1 and the remaining budget hr ■ 

The special case of two-context systems (J = 2) is trivial, where the agent just needs to procrastinate 
for the better context (see Appendix of the supplementary material). When considering more 
general cases with J > 2, however, it is computationally intractable to exactly characterize the 
oracle solution. Therefore, we resort to approximations based on linear programming (LP). 

3.1 Upper Bound: Static Linear Programming 

We propose an upper bound for the expected total reward U*(T, B) of the oracle by relaxing the 
hard constraint to an average constraint and solving the corresponding constrained LP problem. 
Specifically, let pj G [0,1] be the probability that the agent takes action k* for context j, and 1 — pj 
be the probability that the agent skips context j (i.e., taking action At = 0). Denote the probability 
vector as p = {pi,p 2 , ■ ■ ■ ,pj). For a time-horizon T and budget B, consider the following LP 
problem; 


,7 


{CVt,b) maximizep Y^pjTTjU*, 

i=i 

(1) 

,7 

subject to 'Y^Pj'^j ^ B/T, 

i—1 

(2) 

pG [O,!]'^. 



Define the following threshold as a function of the average budget p = B/T: 

j 

j{p) = maxjj ; ^ np < p} (3) 

i'=i 

with the convention that j {p) = 0 if tti > p. We can verify that the following solution is optimal for 
C-'Ptm'- 

p^{p) = J if J = ~j{p) + 1, (4) 

[ 0 , \fj>j{p) + l. 

Correspondingly, the optimal value of CPt,b is 

Up) 

v{p) = Y. + Pj{p)+i(P)^i{p)+i^h)+i- 

This optimal value v{p) can be viewed as the maximum expected reward in a single round with 

average budget p. Summing over the entire horizon, the total expected reward becomes (/(T, B) = 
Tv{p), which is an upper bound of U*{T, B). 

Lemma 1. For a unit-cost system with known statistics, if the tune-horizon is T and the budget is 
B, then U{T, B) > U*{T, B). 

The proof of Lemma[2is available in Appendix[A]of the supplementary material. With Lemma[2 we 
can bound the regret of any algorithm by comparing its performance with the upper bound U (T, B) 

instead of U*{T, B). Since U{T, B) has a simple expression, as we will see later, it significantly 
reduces the complexity of regret analysis. 
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3.2 Adaptive Linear Programming 

Although the solution Q provides an upper bound on the expected reward, using such a fixed 
algorithm will not achieve good performance as the ratio hr/r, referred to as average remaining 
budget, fluctuates over time. We propose an Adaptive Linear Programming (ALP) algorithm that 
adjusts the threshold and randomization probability according to the instantaneous value of hrjT. 

Specifically, when the remaining time is r and the remaining budget is br = b, we consider an LP 
problem CVT,b which is the same as CVt,b except that B/T in Eq. (|^is replaced with h/r. Then, 
the optimal solution for CPr,b can be obtained by replacing p in Eqs.^j^, and Q with 6/r. The 
ALP algorithm then makes decisions based on this optimal solution. 

ALP Algorithm: At each round t with remaining budget br = b, obtain pj {bj tYs by solving CVry, 
take action At = with probability px-t {b/r), and At = 0 with probability 1 — pxt {b/r). 

The above ALP algorithm only requires the ordering of the expected rewards instead of their accurate 
values. This highly desirable feature allows us to combine ALP with classic MAB algorithms such as 
UCB in for the case without knowledge of expected rewards. Moreover, this simple ALP algorithm 
achieves very good performance within a constant distance from the optimum, i.e., 0(1) regret, 
except for certain boundary cases. Specifically, for 1 < j < J, let qj be the cumulative probability 

defined as Qj — '^j' the convention that go = 0. The following theorem states the near 

optimality of ALP. 

Theorem 1. Given any fixed p S (0,1), the regret of ALP satisfies: 

1) (Non-boundary cases) if p qj for any j G {1, 2,..., J — 1}, then Ralp(T, B) < 
where 5 = min{p - gj(p)+i - p}- 

2) (Boundary cases) if p = qj for some j G {1,2,..., J— 1}, then Rxhp(T, B) < + 

= 2(u\ - u*J)^Jp{l- p) and 5' = min{p - ( 7 -(p)_i, (?-(p)+i - p}. 

Theoremshows that ALP achieves 0(1) regret except for certain boundary cases, where it still 
achieves 0{\/T) regret. This implies that the regret due to the linear relaxation is negligible in most 
cases. Thus, when the expected rewards are unknown, we can achieve low regret, e.g., logarithmic 
regret, by combining ALP with appropriate information-acquisition mechanisms. 

Sketch of Proof: Although the ALP algorithm seems fairly intuitive, its regret analysis is non¬ 
trivial. The key to the proof is to analyze the evolution of the remaining budget br by mapping 
ALP to “sampling without replacement”. Specifically, from Eq. 0- we can verify that when the 
remaining time is r and the remaining budget is br = b, the system consumes one unit of budget with 
probability b/r, and consumes nothing with probability 1 — b/r. When considering the remaining 
budget, the ALP algorithm can be viewed as “s amp ling without replacement”. Thus, we can show 
that br follows the hypergeometric distribution 1231 and has the following properties; 

Lemma 2. Under the ALP algorithm, the remaining budget br satisfies: 

1) The expectation and variance ofbr are £[&,-] = pr andYa,T{br) = ~ p)> respectively. 

2) For any positive number 6 satisfying 0 < 6 < min{p, 1 — p}, the tail distribution ofbr satisfies 

P{5,- < (p — (5)t} < ^ and P{6 t- > (p -f (5 )t} < e~‘^^ ^. 

Then, we prove Theorem^fT] based on Lemma Note that the expected total reward under ALP 
is Ualp{T, B) = E[X]t=i where v{-) is defined in (|^ and the expectation is taken 

over the distribution of br- For the non-boundary cases, the single-round expected reward satisfies 
E[u(&T-/r)] = v(p) if the threshold j{br/T) = j(p) for all possible bf^- The regret then is bounded 
by a constant because the probability of the event j{br/T) f j(p) decays exponentially due to the 
concentration property of br- For the boundary cases, we show the conclusion by relating the regret 
with the variance of br- Please refer to Appendix [B] of the supplementary material for details. 

4 UCB-ALP Algorithm for Constrained Contextual Bandits 

Now we get back to the constrained contextual bandits, where the expected rewards are unknown 
to the agent. We assume the agent knows the context distribution as iflTl . which will be relaxed in 
Section0 Thanks to the desirable properties of ALP, the maxim of “optimism under uncertainty” 
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[U is still applicable and ALP can be extended to the bandit settings when combined with estimation 
policies that can cmckly provide correct ranking with high probability. Here, combining ALP with 
the UCB method lU, we propose a UCB-ALP algorithm for constrained contextual bandits. 


4.1 UCB: Notations and Property 


Let Cj^k{t) be the number of times that action k G A has been taken under context j up to round t. 
If Cj^k{t — 1) > 0, let Uj^k{t) be the empirical reward of action k under context j, i.e., Uj,k{t) = 
P = jj = k), where !(•) is the indicator function. We dehne the UCB 


of Uj^k at t as Uj,k{t) = 2C-°S-i) - 1) > 0, and Uj^k{t) = 1 for Cj^k{t - 

1) = 0. Furthermore, we dehne the UCB of the maximum expected reward under context j as 
u*(t) = maxfcg^ Uj,k{t)- As suggested in ll24ll . we use a smaller coefficient in the exploration term 


log* 

2Cj,fc(t-l) 


than the traditional UCB algorithm HI to achieve better performance. 


We present the following property of UCB that is important in regret analysis. 

Lemma 3. For two context-action pairs, {j, k) and (j', k'), ifuj^k < Uj',k', then for any t < T, 

^{Uj,kit) > Uj',k'it)\Cj^k{t - 1) > iyk} < (6) 


where iyu 


2 log T 


Lemma[^states that for two context-action pairs, the ordering of their expected rewards can be iden- 
tihed correctly with high probability, as long as the suboptimal pair has been executed for sufficient 
times (on the order of O(logT)). This property has been widely applied in the analysis of UCB- 
based algorithms ||4l fTSll . and its proof can be found in llT3l l25l with a minor modihcation on the 
coefficients. 


4.2 UCB-ALP Algorithm 

We prmose a UCB-based adaptive linear programming (UCB-ALP) algorithm, as shown in Algo¬ 
rithm [T As indicated by the name, the UCB-ALP algorithm maintains UCB estimates of expected 
rewards for all context-action pairs and then implements the ALP algorithm based on these esti¬ 
mates. Note that the UCB estimates u*(f)’s may be non-decreasing in j. Thus, the solution of 
CPr.b based on ii* (f) depends on the actual ordering of u* (f)’s and may be different from Eq. Q. 
We use Pji') rather than Pjf) to indicate this difference. 


Algorithm 1 UCB-ALP 

Input: Time-horizon T, budget B, and context distribution tt^ ’s; 

Ink: T = T,b = B\ 

Cyk{0) = 0, fe(0) = 0, Mj-fe(0) = 1, Vj e A and 'ik G A, u*(0) = 1, Vj G A; 

for f = 1 to T do 

k*{t) G- argmaxj.Uj,fc(t), Vj; 

if 6 > 0 then 

Obtain the probabilities pjib/ry^, by solving C'Pr,b with u* replaced by 
Take action (t) with probability pxt ib/r)-, 

end if 

Update r, b, Cj^k{t), Uj,k{t), and Uj^k{t)- 

end for 


4.3 Regret of UCB-ALP 

We study the regret of UCB-ALP in this section. Due to space limitations, we only present a sketch 
of the analysis. Specihc representations of the regret bounds and proof details can be found in the 
supplementary material. 

Recall that qj = =i ’’’i' (1 ^ j ^ J) are the boundaries dehned in Section We show that 
as the budget B and the time-horizon T grow to infinity in proportion, the proposed UCB-ALP 
algorithm achieves logarithmic regret except for the boundary cases. 
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Theorem 2. Given -Wj’s, Uj^k’s and a fixed p S (0,1), the regret ofUCB-ALP satisfies: 

1) (Non-boundary cases) if p qj for any j € {1, 2,..., J — 1}, then the regret of UCB-ALP is 
-Rucb-alp(7’, — o\jK\ogT). 

2) (Boundary cases) if p = qj for some j € {1, 2,..., J — 1}, then the regret of UCB-ALP is 

i?UCB-ALp(T,B) = 0 (Vt +JKlogT). 

Theorem[^differs from Theoremj^by an additional term 0{JK log T). This term results from using 
UCB to learn the ordering of expected rewards. Under UCB, each of the JK content-action pairs 
should be executed roughly O(logT) times to obtain the correct ordering. For the non-boundary 
cases, UCB-ALP is order-optimal because obtaining the correct action ranking under each context 
will result in O(logT) regret EM . Note that our results do not contradict the lower bound in IITtII 
because we consider discrete contexts and actions, and focus on instance-dependent regret. For 
the boundary cases, we keep both the s/T and log T terms because the constant in the log T term 
is typically much larger than that in the s/T term. Therefore, the logT term may dominate the 
regret particularly when the number of context-action pairs is large for medium T. It is still an open 
problem if one can achieve regret lower than 0(s/T) in these cases. 

Sketch of Proof: We bound the regret of UCB-ALP by comparing its performance with the bench¬ 
mark U{T, B). The analysis of this bound is challenging due to the close interactions among differ¬ 
ent sources of regret and the randomness of context arrivals. We first partition the regret according 
to the sources and then bound each part of regret, respectively. 

Step 1: Partition the regret. By analyzing the implementation of UCB-ALP, we show that its 
regret is bounded as 

^UCB-ALp(2 ^) B) < ^ucb-ALp("^’ -^UCB-ALp("^’ ^)’ 

where the first part ^UCB-ALP(^^^) = E/=i - Uj.fe)E[C'j-fe(r)] is the regret from 

action ranking errors within a context, and the second part .Rucb-alp(^> ~ X]t=i ® ~ 

Pj /T)TrjU*] is the regret from the fluctuations of br and context ranking errors. 

Step 2: Bound each part of regret. For the first part, we can show that f?ucB-ALp(^> = 
O(logT) using similar techniques for traditional UCB methods Il25l . The major challenge of regret 
analysis for UCB-ALP then lies in the evaluation of the second part ^u^cb-alp ^)- 

We first verify that the evolution of br under UCB-ALP is similar to that under ALP and Lemmal^ 
still holds under UCB-ALP. With respect to context ranking errors, we note that unlike classic UCB 
methods, not all context ranking errors contribute to the regret due to the threshold structure of 
ALP. Therefore, we carefully categorize the context ranking results based on their contributions. We 
briefly discuss the analysis for the non-boundary cases here. Recall that j{p) is the threshold for the 
static LP problem CPt.b- We define the following events that capture all possible ranking results 
based on UCBs: 

fra„k.o(f) = {Vj < j{p),U*{t) > >j{p) + l,U*{t) < (f) }, 

frank.l(f) = {3j < j{p),U*{t) < (f); Vj > j(p) + < U-.^^^^^{t)}, 

frank.2(f) = > j{p) + l,Uj{t) > (<)} . 

The first event frank,o (f) indicates a roughly correct context ranking, because under frank,o (f) UCB- 
ALP obtains a correct solution for CVr,br if ^t/t € [^^(p)) 9j(p)+i]- i^o events frank,s(i). 

s = 1, 2, represent two types of context ranking errors: frank,i(i) corresponds to “certain contexts 
with above-threshold reward having lower UCB”, while frank, 2 (f) corresponds to “certain contexts 
with below-threshold reward having higher UCB”. Let l(frank,s(i)) for 0 < s < 2. 

We can show that the expected number of context ranking errors satisfies = 0( JATlogT), 

s = 1, 2, implying that f?ucB-ALp(^> B) = 0{JKlogT). Summarizing the two parts, we have 
f?uCB-ALP {T, B) = 0{JK log T) for the non-boundary cases. The regret for the boundary cases 
can be bounded using similar arguments. 

Key Insights from UCB-ALP: Constrained contextual bandits involve complicated interactions 
between information acquisition and decision making. UCB-ALP alleviates these interactions by 
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approximating the oracle with ALP for decision making. This approximation achieves near-optimal 
performance while tolerating certain estimation errors of system statistics, and thus enables the 
combination with estimation methods such as UCB in unknown statistics cases. Moreover, the 
adaptation property of UCB-ALP guarantees the concentration property of the system status, e.g., 
hr It. This allows us to separately study the impact of action or context ranking errors and conduct 
rigorous analysis of regret. These insights can be applied in algorithm design and analysis for 
constrained contextual bandits under more general settings. 


5 Bandits with Unknown Context Distribution 

When the context distribution is unknown, a reasonable heuristic is to replace the probability in 
ALP with its empirical estimate, i.e., TTj{t) = = j)- We refer to this modified ALP 

algorithm as Empirical ALP (EALP), and its combination with UCB as UCB-EALP. 

The empirical distribution provides a maximum likelihood estimate of the context distribution and 
the EALP and UCB-EALP algorithms achieve similar performance as ALP and UCB-ALP, respec¬ 
tively, as observed in numerical simulations. However, a rigorous analysis for EALP and UCB- 
EALP is much more challenging due to the dependency introduced by the empirical distribution. To 
tackle this issue, our rigorous analysis focuses on a truncated version of EALP where we stop updat¬ 
ing the empirical distribution after a given round. Using the method of bounded averaged differences 
based on coupling argument, we obtain the concentration property of the average remaining budget 
br/r. and show that this truncated EALP algorithm achieves 0(1) regret except for the boundary 
cases. The regret of the corresponding UCB-based version can by bounded similarly as UCB-ALP. 


6 Bandits with Heterogeneous Costs 

The insights obtained from unit-cost systems can also be used to design algorithms for heteroge¬ 
neous cost systems where the cost Cj^k depends on j and k. We generalize the ALP algorithm to 
approximate the oracle, and adjust it to the case with unknown expected rewards. Eor simplicity, we 
assume the context distribution is known here, while the empirical estimate can be used to replace 
the actual context distribution if it is unknown, as discussed in the previous section. 

With heterogeneous costs, the quality of an action k under a context j is roughly captured by its 
normalized expected reward, defined as r/j ^ = Uj^k/cj^k- However, the agent cannot only focus 
on the “best” action, i.e., k* = argmaxj,g _4 pj fc, for context j. This is because there may exist 
another action k' such that < Vj.k*, but Uj^k' > Uj^k* (and of course, Cj^k' > Cj,k*)- If 
the budget allocated to context j is sufficient, then the agent may take action k' to maximize the 
expected reward. Therefore, to approximate the oracle, the ALP algorithm in this case needs to 
solve an LP problem accounting for all context-action pairs with an additional constraint that only 
one action can be taken under each context. By investigating the structure of ALP in this case and 
the concentration of the remaining budget, we show that ALP achieves 0(1) regret in non-boundary 
cases, and 0{VT) regret in boundary cases. Then, an e-Eirst ALP algorithm is proposed for the 
unknown statistics case where an exploration stage is implemented first and then an exploitation 
stage is implemented according to ALP. 


7 Conclusion 

In this paper, we study computationally-efficient algorithms that achieve logarithmic or sublinear 
regret for constrained contextual bandits. Under simplified yet practical assumptions, we show 
that the close interactions between the information acquisition and decision making in constrained 
contextual bandits can be decoupled by adaptive linear relaxation. When the system statistics are 
known, the ALP approximation achieves near-optimal performance, while tolerating certain estima¬ 
tion errors of system parameters. When the expected rewards are unknown, the proposed UCB-ALP 
algorithm leverages the advantages of ALP and UCB, and achieves 0{logT) regret except for cer¬ 
tain boundary cases, where it achieves 0{VT) regret. Our study provides an efficient approach of 
dealing with the challenges introduced by budget constraints and could potentially be extended to 
more general constrained contextual bandits. 
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Appendices 

A Proof of Lemma Upper Bound 

We prove Lemma hjby comparing U (T, B) with the expected total reward under any feasible algo¬ 
rithm satisfying th^udget constraint. 

Let Cj be the number of rounds that an action is taken under context j for any realization under 
any feasible algorithm with known statistics. Let pj = 'E[Cj]/ {ttjT), which satisfies 0 < Pj < 
1. Then the expected total reward becomes ■ Further, because 

the hard budget constraint is met for all realizations, i.e., J2j=i have = 

E[Cj]/T < B/T. Thus, the expected total reward obtained by any feasible algorithm, including 
the oracle algorithm, is upper bounded by U{T, B). 

B Proof of Theorem [T| Near Optimality of ALP 

B.l Proof of Lemma|^ Evolution of Remaining Budget 

The evolution of the remaining budget is critical for evaluating the expected total reward under 
ALP. We prove Lemmaj^by casting ALP to a sampling problem without replacement. 

From the implementation of ALP, we can verify that when the remaining time is r and remaining 
budget is br — b, the system consumes one unit of budget with probability b/r, and consumes 
nothing with probability 1 — b/r. Thus, when focusing on the remaining budget, we can view the 
ALP algorithm as a sampling problem without replacement as follows. 

Mapping ALP to Sampling without Replacement: Consider T balls in an urn, including B black 
balls and T — B white balls. Running ALP is equivalent to randomly drawing a ball without re¬ 
placement. Taking an action At > 0 is equivalent to drawing a black ball and taking the dummy 
action At = 0 is equivalent to drawing a white ball. The event that b^. — bis equivalent to the event 
that the agent draws T — t balls, and the number of drawn black balls is B — b. 

Based on the above mapping and using its symmetric property we know that br follows the hyper¬ 
geometric distribution 11^ and complete the proof of Lemma]^ 

B.2 Part 1: Non-Boundary Cases 

According to Lemma[^ U (T, B) is an upper bound on the total expected reward. Thus, 

T 

U*{T,B) - Ualp{T,B) < U{T,B) - Ualp{T,B) = ^ {u(p) - E[^;(6,/r)]}. (7) 

T—1 

To evaluate the gap between the single-round values, we define an auxiliary function vib/r) for a 
given p as follows: 


v{b/T) 


j{p) 


+ ^Ap)+lF3■(p)+l(^/^)^-(p)+l’ 
i=i 


( 8 ) 


where 




b/r-E?: 


Ap) 

j=i 




This auxiliary function bridges the gap of single-round values, v{p) and E[z;(5t/t)], as follows: 

First, we note that v{b/T) uses the same threshold j{p) as in v{p). The only difference between 
v^b/r) and v{p) is that v{b/T) uses the instantaneous average bu^et b/r instead of the fixed average 
budget p. Considering all possible &’s and according to Lemma® we have 

V{p) -E[v{br/T)] = {p-E[br/T]}u~^^^^^ = 0. (9) 
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Second, compared with v{b/T), the difference of the auxiliary function v{b/T) comes from the 
event of j{b/T) ^ j{p), which only occurs when b/r < or b/r > Because p 7 ^ qj, 

1 < j < J — 1, there exists a positive number S — min{p — qj(p)jqj{p)+i ~ p} s*ioh that for all 

p—S<p'<p + 6, the threshold under p' is the same as that under p, i.e., j {p') = j {p). Therefore, 
for all b satisfying p — 6 < b/r < p + S, v{b/T) = v{b/T). 

If b/r < p — S, then 


< 


< 


< 


v{b/T) — v{b/T) 

i(.p) ^ 

j=j{b/T) + l 

i(p) ^ 

< _E ’^J■ + (--9J(P))^E+l 

j=J{b/T) + l 

3(p) 

(% - W|(p) + i) E 

3=3{b/T) + l 

95(p)K-«j)- 


~ ^3(b/T)}'^^j(b/T) + l 

~ ^3(b/T)}'^^j(p) + l 


( 10 ) 


Similarly, if 6/t > p + 6, then 


< 


vib/r) — vib/r) 

(7 ~ ^f(p))^j(p)+i ~ 
(^“9j(p))K -u})- 


3 (b/r) 

E ^1^*3 

i=i(p)+i 


<b 


9i(6/r))’ 


0{b/T) + l 


( 11 ) 


Summing all the above three cases (p — S < b/r < p + 6, b/r < p — 6, and b/r > p + S) and using 
Eq. (|^, we have 

V{p) - E[v{br/T)] 

= v{p) - E[v(br/T)] + E[v{br/T) - v{br/T)] 

= E = b)[d{b/T) - u(6/t)] 

&<r(p—(5) or 6>r(p+(5) 

- 9j(p)«-M})P{6r <t(p-(5)} 

> t{p + (5)} 

< (ut - u*j)e-^^"\ (12) 

Part 1 of Theoremthen follows by substituting Eq. ( [T^ into Eq. (|7]i. 


B.3 Part 2: Boundary Cases 

The proof of Part 2 of Theorem [T| is similar to that of Part 1 . Specifically, when p = qji^py let 
5' = min{p — q~j^p-^_i, q~j^p-^j^i ~ W- From the proof of Part 1 , we know that 

u(p) - E[{;( 6 T-/r)] = 0 . ( 13 ) 


In addition, \f p < b/r < p + 5' , then j{b/T) = j{p) and v{b/T) = v{b/T). 
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If p — <5' < 6/t < p, we have jib/r) = j{p) — 1, and 


v{b/T) — v{b/T) 

= (^J(P) + '?J(P)-1 - ->]ip) + (7 ■ '^J(P))^*(P)+1 

= {p - )u*, p)mv , 1 

7 - j(p) V ^ j(p)+i 

< |^-p| («*-«}). (14) 

T 


Moreover, we still have ( fTO) ! if b/r < p — S', and ( [TT| l if b/r > p + S'. 

Compared with the proof of Part 1, we know that the only difference relies on the case of p — < 

b/r < p. Thus, summing all the above cases and using the results in the analysis of Part 1, we have 


E[i{br/T) - V{br/T)] < - u}){E[|6^/t - p|] + 6 < (y* _ y* ) _p g 


Consequently, 


U*{T,B)-Ualp{T,B) < U{T,B)-Ualp{T,B) 


= ^{^^(p)-Eb(^r/r)]} 

r^l 

T T 

= ~ + - V{br/T)] 

T^l r^l 


r=l 

T 




' (r-r)p(l-p) ^_2(5')" 


(r-l)r 


< 


T=1 




K - U*j)^/p{l - p)J2 \J -+ I 


< - u})Vf+ ^ 


C Proof of Theorem 1^ Regret of UCB-ALP 


We bound the regret of UCB-ALP by comparing its performance with the benchmark U (T, B). To 
obtain this upper bound, we first partition the regret according to the sources and then bound each 
part of the regret, respectively. 

Before presenting the proof, we first introduce a notation that will be widely used later. For contexts 
j and j', and an action k, let be the difference between the expected reward for action k under 
context j and the highest expected reward under context f, i.e., = u*, — Uj^k- When j' = j, 

is the difference of expected reward between the suboptimal action k and the best action under 
context j. 
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C.l Step 1: Partition the Regret 

Note that the total reward of the oracle solution U*{T,B) < U{T,B). Thus, we can bound the 
regret of UCB-ALP by comparing its total expected reward 17 ucb-alp(7^, B) with U{T, B), i.e., 

^ucb-alp(T’, B) 

= U*{T,B)-Uvcb-alp(.T,B) 

< UiT,B)-UjjcB-ALpiT,B) 

J K 

= Tv{p)-Y,Y.^,,unC,,k{T)]. (15) 

j=l fc=l 

The total expected reward of UCB-ALP can be further divided as 

C^ucb-alpCT", B) 

j—1 k—1 j—1 k—1 

i=i i=i fc=i 

where Cj{T) = Cj,k{T) is the total number that actions have been taken under context j up 

to round T. 

Consequently, the regret of UCB-ALP can be bounded as 

^UCB-ALP (2^1-B) < -R[jq3_^lp(T', B) + _R[jQg_y^pp(T, B), (16) 

where 

4E-ALPm ^) = E E AgE[Q,fc(r)], 
i=i fc=i 

and 


n(c) 

riuCB-ALP 


(T,B) = ^E 


r=l 


J 

^(P) - '^hi^r/T)TTjU 


i=i 


* 

3 


Eq. ( [Thl l clearly shows that the regret of the UCB-ALP algorithm can be divided into two parts; the 
first part ^ucb-alp front taking suboptimal actions under a given context; the second part 

^UCB-ALP fr fr'om the deviation of remaining budget hr and context ranking errors. 


C.l Step 2; Bound Each Part of Regret 
C.2.1 Step 2.1: Bound of B) 


For the regret from action ranking errors, we show in Lemma j^that l?[icB-ALp(^’ “ O(logr) 


using similar techniques for traditional UCB methods 
Lemma 4. Under UCB-ALP, the regret due to the action ranking errors within context j satisfies 


R 


UCB-ALP i'Bj B) 


A EE 

j—1k^k 


> (tU) +2 AA)logT + 2A.^^ 


(17) 
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Proof. For k f k*, let According to Lemma 3 we have 


E[C',-fe(T)] 


< 


+ E -1) > 4'i} 

t^i 
T 

+ E^{^‘=-1) ^ ^ 

t^i 

^ 4'i + E2i-i. 


< 




The conclusion then follows by the facts that ^ < 1 + logT and ii, 

E;=iEfe^fe.AgE[C',,,(T)]. 


UCB-ALP 


iT,B) = 

□ 


C.2.2 Step 2.2: Bound of -Rucb-alp (T, B) 

Next, we show that the second part ^ucb-alp(^’ B) = O(logT). We first present the proof for 
the non-houndary cases, and discuss the boundary cases later. 

Note that we have separately considered the regret due to action ranking errors in ^ucb-alp(^’ B) 
and we only need to consider the best action of each context for ^ucb-alp(^! B). Thus, we define 
^UCB-ALP ("T, br) as follows: 

.7 

<CB-ALp(T&r) = '^Pj{hr/T)TTjU*. 

7 = 1 

Let be the single-round difference between UCB-ALP and the upper bound, i.e., 

Atlr = v{p) — 'CuCB-ALp(U br) ■ 

Then f?ucB-ALp(^!= ST=TE[AtiT]- We study the expectation E[Aur] under all possible 
situations. For a random variable X and event 8, let E[A, £] = E[A1(£)]. Then, the expectation 
E[A] = E[A, 8] + E[A, Therefore, 

2 

E[Au,] =^E[A7;,,£'ranM(r-T+l)]. (18) 

s=0 


We first consider the case of s = 0 and convert the expectation value into other two cases. Consid¬ 
ering all possible value of br, we have 

E[A7;t-, frank,o('r “ ^ + 1)] 

B 

= y^E[APr|&T = frank,o('r -T+ l)]P{&r = frank,PC'?" -T+ 1)}. (19) 

6=0 


For the probability, we have 


2 

P{6t- = 6,frank,o(7’- T + 1)} = E{&r = b} - P{6t- = &, frank,s(T 

S = 1 


T-f 1)}. 


( 20 ) 


For the conditioned expectation, we note that frank,o(2^ — t -f 1) provides a roughly correct context 
rank in the sense that if hr/r is close to p, then fucB-ALp('i') br) = ^ibr/T), where v{br/T) is the 
single round value with the correct context rank. Specifically, letting S = ^ min{p — — 

p}. If 6 S [p — S,p + <5], then ■Pucb-alp(u b) = v{b/T), and thus, 

E[A7;.r|^'T = b, frank.o(?' “ + 1)] = P(p) “ v{b/T). (21) 
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Combining Eqs. ( [T9] l ~ ( |2T] l and using the facts that v{p) > 0 and Uugg_ALp(T, h) > 0, we have 
E[Az;^, frank,0(7" “ "T + 1 )] 

B 

< v{p) — ^ v{b/T)V{bT = b} 

6=0 

2 B 

+EE v{b/T)¥{br = b, Srank,s{T - T + 1)} 
s=l 6=0 

+ E v{b/T)¥{b^ = b,Srs,nk,o{.T - T+ 1)}. (22) 

6^ [p—5,P+5] 

Recall that under UCB-ALR, the remaining budget br follows the hypergeometric distribution. Using 
the same method as the analysis of Eq. ( [l^ , we have 

B 

~ E i’{b/T)P{br = b} < {ul - Uj)e~'^^ (23) 

6=0 


In addition, 

^ v{b/T)V{br = 6, frank,0(T - T + 1)} < U* ^ V{br = b} < 2u*e-^^"\ (24) 
6^[p—5,p+5] 6^[p—5,p+5] 

where u* = expected reward without budget constraint. 

Moreover, 

B 

V(b/T)¥{br = b, frank,a(7’ “ T + 1)} < U*P{f rank,s -T + 1)}, (25) 

6=0 

Substituting Eqs. < |2^ ~ ( [25] > into Eq. ( [22| ), we have 

2 

E[AUr, frank,o(T ” T + 1)] < [u*, - U*j + E P{^rank,a(T - T + 1)}. (26) 

s=l 

When the rank is wrong, i.e., 1 < s < 2, since Avr < v{p) under any possible ranking results, we 
have 

E[AUr,frank,s(7’- r + 1)] < t; (p)P{frank,s (?" “ T + 1) }. (27) 

Substituting Eqs. ( |2^ and ( |Z7| l into Eq. ( [TSl l, we have 

2 

E[Az;r] < K -U*J + 2r + [u* + v{p)] P{frank,«(T - r + 1)}. 

S=1 


Note that '^UCB-ALp(^’ ~ X]r=l IE[AUt-]. ThuS 


^UCB-ALp(’^>A 


u* + 2u*]e-^^^ 
1 - 


2 

[t2* + z;(p)]y]E[r(^)], 

S = 1 


( 28 ) 


where l(frank.s{i)) (s = 1, 2) is the number of type-s ranking errors. 

Next, we bound the expected number of context ranking errors. From Lemma we know that 
to obtain the correct ordering of two context-action pairs with high probability, the agent needs to 
execute the suboptimal context-action pair for enough times. Unlike traditional MABs, however, 
the context-action pair with the higher UCB in a round might not be executable, as the context of 
that round could be different. Fortunately, the following lemma will show that if the condition that 
causes the an context-action pair to be executed with a positive probability appears many times, the 
context-action pair will indeed be executed proportionally with high probability. 
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Lemma 5. Assume St’s, St’s are events in round t (1 < t < T), satisfying = 

¥{£t\£t} > p > 0, where Tii-t-i is the filtration from 1 to t — 1. Let C{T) = ^i^t) ond 

C{T) = Ef=i H^t). Then, 

P{C'(r) <{p- €)N, C(T) > A^} < 


Proof One may think the proof of this lemma is trivial because P{C'(T) < {p — e)N, C{T) > 
N} < P{C'(T) < {p — e)N\C{T) > N} and we can bound the right-hand-side using Chemoff 
bound. However, this is incorrect because although £t is independent of the history given £t, the 
event £t^i may depend on £{t). 

We prove this lemma using the coupling argument. Let St = l(i£t H £t), and Cs{T) = 

Then, we have 

P{C(r) <{p- e)N, d{T) >N}< P{C's(T) <{p- e)N, d{T) > N}. (29) 


Now, we show P{C' 5 '(T) < {p — e)N, C{T) > N} < ^ using the coupling argument. 

First, generate Wi, W 2 ,..., Wt i-i.d. according to Bernoulli distribution with V{Wt = 1} = p. 
Next, generate a sequences (L/, S'j, 1 < t < T) as follows: 

For each t, generate L/ according to Bernoulli distribution with V{Vl = 1} = P{£t = \\\{£t') = 
Vf,, t{£t') = S[, \ < t' < t — 1}. Further, we generate S't conditioned on the value of L/ and Wt- 
Specifically, let Cv'{t) = 

1) If Vl = 1 , generate S't conditioned on Wc^ptf- 

a. IflLc,,(t) = l,then5^ = l; 

b. If Wcy, (t) = 0, then generate S't according to Bernoulli distribution with 


V{S't = l\Wcy,it) = 0} 


2) If V' = 0, let S't = 0. 


¥[t{£t) = m£t) = l]-p 

1-p 


We can verify that (L/, S't,l < t < T) has the same distribution as {t{£t), St,l < t < T). Hence, 
PIC's (T) < (p - e)N, d(T) >N}= Pj^Li S't<{p- e)N, Zh K > N}. 

On the other hand, from the generation of S^, we have J2t=i Thus, the event 

{ELi S't<{p- e)N, Vt' > N} implies Wt < {p - e)N}, and 

T T N 

PlJI S't<{p- e)N, Y, K > A^} < PlJI Wt<ip- e)N} < (30) 


The conclusion of the lemma then follows. 


□ 


The following lemma bounds the expected number of context ranking errors. 

Lemma 6. Given tt^ ’s, Uj^k’s and a fixed p € (0,1), p qj (1 < j < J — 1), under the UCB-ALP 
algorithm, we have 


K 


J{p) 

E[tW]<^^ 


27 log r 


j=i k=i ‘^9j(p)+i 


W 12 

y(p)+i.fcJ 


+ 2Kj(p)logT + 0(l), 


E[t(2)]< ^ ^ 

i=i(p)-i-2 ^=1 


27 log T 


2gj [A 


(i(p)-i-i)l2 
j,k J 


+ 2K[J-j{p)-l] logr + 0(l), 


where gj = min {tTj, i(p - 5 (gj(p)+i - p)}■ 
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Proof. We only prove the conclusion for the case of s = 1 as the other case can be analyzed sim¬ 
ilarly. From Algorithm we can see that the evolution of the remaining budget also affects the 
execution of the UCB-ALP algorithm. Under the assumption of known context distribution, it can 
be verihed that LemmaSholds under UCB-ALP, i.e., the remaining budget hr follows the hyperge¬ 
ometric distribution ananas the properties described in Lemma We dehne an event fbudget,o(f) 
as follows. 


^budget, o(i) = {{P - 5)t <br<{pP (5)t}, 


where <5 is given by 


(5 = 


- mini p 
2 


^Kp)^^Kp)+'^ 


P}- 


According to Lemma|^ we have 


P{^fbudget.o(0} = nbr <{p- ^)r} + nbr > {p + <5)r} < 


Back to the ranking event frank, i (f), we have 

P(frank,l(i)) < P(~'fbudget,0 (0) + P(frank,l (i) Cf budget,0 it)). 

Note that the event frank,i(i) can be divided as follow; 

frank,l(f) L ‘^’rankd (^) i 

l<j<j(p)P<k<K 


where for 1 < j < j {p) and 1 < k < K, 

= {V/ > j{p) + = k}. 

Thus, 


T 9p-2-5" ^ 

E[T«] = ^ P(frank,l(f)) < ^ + E E (^)]^ 

j^l k^l 

(31) 

where for 1 < j < j (p) and 1 < k < K, 

= E l(^ranM(^'),4udget,0(i'))- 

Let (Ji) 2 ’ where = minlTr^j, (5} and e S (0,1). Similar to the analysis of 

UCB in 101, we have 

E[iVjJ(T)] < +E^{^EM(^)’^bndget,0(b,Efc(t- 1) > ^32) 


For each t in the second term, we have 

P{^rEl(^)-^budget.0(f),ivE(^- 1) ^ + 

— P{^rfnk,l(^)’'^’^“'l8®‘h(^)l^j(p)-|-l,fe(^ ~ 1) ^ 5j(p) + l(l ” £)^E) + l.fc^ 

+ mipwit - 1 ) < - 1 ) > 

where f.{t) = X]t'=i = j{p) + 1; = k) is the number that the context-action pair 

{j{p) + 1, k) has been executed up to round t. 
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For the first term, we note that the event {£^^ank\(^)’'^budget,o(0} implies that Uj^k*{t) < 
Mj(p)+i fe(f)- Because Uj^k* > j. for all j < j{p) and k, according to Lemmaj^ we have 

^{'^rank,l(^)’'^budget,o(^)|Cj(p) + l,fc(^ ~ 1) ^ 5j(p) + l(l — ^.} 

< P{%fe.(f) < - 1) > 5j(p)+i(l - eKj(p)+i,J 

< 2t-\ (33) 


For the second term, we note that since context j{p) +1 arrives with probability independent 

of the observations, we have 

P{^t = Up) + 1,^ = fc|'^^rankV^)’^b“dget.o(f)} = min{(5, J = 5j(p) + i. 

Thus, according to Lemmawe have 

-1) <Sj(rt+.(1 -df«<i>(*-1) > < r-. 

(34) 


Substituting Eqs. ( |33) and ( |34| l into Eq. ( [32l l, we have 

T 


E[ivW(T)] < + T-4) < + 21ogT + 2 + (35) 


t=i 


, we have 


Substituting Eq. ( |T5| l to Eq. ( |3T] l and letting e = 2/3 in j 

+ Ei:4’,L.,. + 2AlWiogr + 0(i) 


E[r(^)] < 


2 e- 
1 - e-2^= 


j=i k=i 

27 log T 


/D 


j(p) K 


+ 2iTj(p)logT + 0(l). 


(36) 


□ 


Combining Lemma]^ Lemma]^ and Eq. ( |28l ), we have 

,. R\J CB-ALP iT,B) „(a) , r^{c) 

\0gT < 0^ ^ + 0„b . 


T->c 


where 


©('"I = ^ ^ + 2A^^'^ 


j = l fe/fe* ^j,k 


i.fc/’ 


©nb = +^(p)]i 


f(p) if 


27 


if 


=1 fc=i ^^i(p)+i[Aj(p)_|_i_fc]^ i=)(p)+2 fe=i "*] 


+ E E 


27 


. (i(p)+i)l2 


2 a:j 


This completes the proof of Part 1 in Theorem]^ 


Next, we discuss the bound of ^ucb-alp(^> R) boundary cases. The analysis is similar 

to the non-boundary cases with slight modification on the threshold. 


We note that fundamentally, the context j {p) + 1 for p 7 ^ qj and the context j (p) for p = qj are 
both the minimum context with positive probability in the static LP problem. Thus, we can define 
the context ranking events £rank,s(f) (0 < s < 2) similar to the analysis of Part 1, with j(p) + 1 
replaced by j(p). Then, we have 

T 

'^UCB-ALp(^>= E E[Au.r]) 

T = 1 
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where 


E[AVr] = y^E[Az;-r,£’rank,5(^ - T + 1)]- 

s^O 

For the case of 5 = 0, 

B 

E[A?;t-, frank,o('r “ + 1)] = ^E[Az;t-|6t- = 6,frank,o('r - T + l)]P{6r = frank,0(7" “ T + 1)}. 

b=0 

When b/r G [p — S, p + S] and frank,o(r — t + 1) occurs, we have Aui- < K - u*j)\p-b/T\. 
Moreover, Aut- < v{p) under any condition. Thus, 

E[AUr, frank,o('r “ ^ + 1)] 

<ulE[\br/T-p\]+v{p) ^ ¥{b^ = b} 

b^[p—S,p+S] 


— "^1 


Yai{br) 


2v{p) 


-2S^ 


For the other cases of s = 1, 2, we have 

E[AUi-,frank,s(7’- r + 1)] < t;(p)P{frank,a “ T + 1)}. 


On the other hand, we extend Lemma[^to the boundary cases: 

^ 27 


E[T(i)] V- V- 

lim sup —-— < > > ■ 

T^oo logT 2q V 1^ 

j<j(p)fe-l j(p),k‘ 

,• E[T(2)] 27 

® j>j(p)+i/c=i 25j[A^-fc J 


+ 2i^[j(p)-l], 


2K[J-~3{p)], 


where 


Qj = min{7rj, - 9j(p)-i), \{Tj(p)+i - p)]- 


Consequently, we can bound ^[jcb-alp(^’ summing over the entire horizon and using the 

properties of and The conclusion of Part 2 of Theorem then follows by adding the 

bound of -^UCB-ALp(^’-^UCB-ALp(^>-®)' 


D Two-Context Systems with Unit-Cost 

As a special case, the oracle algorithm can be obtained for two-context systems with unit costs. 
When the context distribution and expected rewards are unknown, the oracle algorithm can be com¬ 
bined with the UCB method to achieve logarithmic regret under both boundary and non-boundary 
cases. 


D.l Oracle Algorithm: Procrastinate-for-the-Better-context 

When there are only two contexts, the oracle algorithm is trivial. Under the unit-cost assumption, 
skipping the worse context does not waste any opportunities if br < t. Thus, the agent can reserve 
budget for the better context, unless there is sufficient budget; i.e., we have the following algorithm: 

Procrastinate-for-the-Better (PB): If = 1 and br > 0, or if br > t, take action At = k^^', 
otherwise. At = 0. 

We can verify that the above PB algorithm achieves the highest expected reward for any realization 
of the context arrival process. Thus, the PB algorithm is optimal in two-context systems. We note 
that the PB algorithm does not need to know the context distribution and only requires the ordering of 
the expected rewards. This property allows us to extend it to the case where the context distribution 
or expected rewards are unknown. 
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D.2 UCB-PB: Logarithmic Regret Algorithm for Two-Context Bandits with Unit-Cost 


When the context distribution and expected rewards are unknown, we propose the UCB-based 
Procrastinate-for-the-Better (UCB-PB) algorithm for solving the constrained contextual bandit prob¬ 
lem with two contexts and unit costs. 


Algorithm 2 UCB-PB 

Input: Time-horizon T, budget B-, 

Init: Remaining time t = T, remaining budget b — B-, 
C'j,fc(0) = 0, Uj,k{0) = 0, Uj,k{0) — 1, for all j e X 
and k G A', u* (6) = 1 for all j G X\ 
for t = 1 to T do 

k*{t) G- argmaxj.Uj^(t), Vj; 
u*{t) G- Vj; 

j*{t) G- argmax^ 

if 6 > T or (0 < 6 < T and Xt = j*(t)) then 
Take action k*x_^{t)', 

end if 

Update r, b, Cj^k{t), and 

end for 


As shown in Algorithm]^ the agent maintains UCB estimates Uj i^{tys for the expected rewards 
of all context-action pairs. In each round, the agent implements the PB algorithm based on these 
estimates. 

Next, we study the regret of the UCB-PB algorithm. We show that the UCB-PB algorithm achieves 
logarithmic regret for any given p G (0,1). 

Theorem 3. For a constrained contextual bandit with unit-cost and two contexts, the UCB-PB 
algorithm achieves logarithmic regret as T goes to infinity, i.e.. 


lim sup 

T —¥oo 


Rucb-pb{T, B) 
logT 




27 


-27r2A 


( 1 ) 

2,k 


2A 


( 1 ) 

2,fe 


2 

+ EE 


3 — 1 



-f 2A 


U) 

j,k 


Proof. The proof of Theoremis similar to that of Theorem!^ while the analysis on the error 
events is even simpler. Note that the regret is defined as the difference between the expected total 
rewards achieved by the UCB-PB algorithm and the oracle algorithm. For the oracle algorithm, let 
Cj (f) = = j, At> = A:*} be the number of times that the context-action pair (j, k*) 

has been executed up to round t. For the UCB-PB algorithm. Recall that Cyk{t) = St'=i = 

j, At' = kj } is the number of times that the context-action pair (j, k) has been executed up to round 
t, and let Cj{t) = ^j,k{t)- Then the regret of UCB-PB can be expressed as 

f?UCB-PB(7’) B) 

= ^ u*E[qiT)] -^211 u,,kEqk{T)] 

3 = 1 3 = 1 k=l 

= E E ^3inC3,k (T)] + 5] u*E[C* (T) - f2 C,,k (T)] 

= 4c^b-pb (T, B) + iT,B), (37) 

where f?u'cB-PB(^> B) is the regret due to action-ranking errors, i.e., 

4cB-PBm^) = E E 

3 = 1 k^k* 
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and R 


(c) 

UCB-PB 


(T, B) is the regret due to context-ranking errors, i.e.. 


J K 

= J2u*E[c;iT) -J2 c,AT)] = K - um[Cl{T) - C^{T)]. (38) 

j=l k=l 

The expression of -Rucb-pb(^j '■h® oracle algorithm and UCB-PB will 

exhaust their entire budget, i.e., J2j=i C*{T) = — ^■ 

For f?ucB-pB(^’ ^)’ Lemma|^also holds under UCB-PB, i.e.. 


n(a) 

“UCB-PB 


iT,B)<J2J2 

j = l k^k* 




+ 2Ag)logT + 2A« 


(39) 


Next, we show that f?ucB-PB(^! order O(logT). Let (Aj, At) be the context-action 

pair that has the highest UCB in round t. Moreover, let Cj (t) be the number of events that context 
j has the maximum index up to round t, i.e., Cj[t) = Y^t'=i = j)’ ^^d Cj^k{t) be the number 
of events that the context-action pair (j, fc) has the highest UCB up to round t, i.e., Cj^k{t) = 
St'=i = A At = k). We show that the UCB-PB algorithm mistakes the suboptimal context 
as the optimal context for at most O(logT) times, i.e., £[(72(T)] = (9(logT), and then E[(7*(r) — 
Ci(T)] <E[C' 2 (T)] = 0(logT). 

Specifically, consider the suboptimal context j = 2. For 1 < A: < we have 

T 

E[(72,fc(T)] < + y]P{At = 2,At = k,K > 0,C2.fc(f - 1) > igl. 


where 



21ogT 

7r2(l-e)e2(A<[))2’ 


and e G (0,1). 


Based on Lemmawe have 


P{C'2,fc(f - 1) < 7r2(l - 


b 

^2 u, 


>0,C2 ,fc(f-l)>£yi}<e- 


< T 


-4 


(40) 


Thus, 


F{Xt =2,At = k, C2,k{t - 1) > > 0} 

< P{At = 2,At = fc,C2.fe(f- 1) > 712(1 - e)All} 

+ P{C'2,fc(i ~ 1) < '^ 2(1 ~ ^2,k(t ~ 1) > -^2,1’ ^ 

< lP{w2,fe(0 > Mi,ferW|C'2,fc(f- 1) > 712(1 - 

+ P{C 2 ,fc(f - 1) < 7r2(l - e)ii% C 2 ,kit - 1) > i^lAr > 0} 

< 2f-^ -f T""*, 

where the last inequality results from Lemma [31 (note that for j = 2, U2,k{t) < Ul,fe*(f) < Ui(f)) 
and Eq. ( |40l l. 

Summing over all actions, we have 


E[C2(r)] 


EE[C2,fe(T)] <X^4^1 + X]X](2ri+T-4) 




fc = l t = l fc = l 


E 


- 772(1 


2 



iogr + o(i). 
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Consequently, 


E[C'i*(r)-Ci(T)] < 


The last equality is obtained by letting e = 2/3. Combining Eqs. ( |39l l, Om, and (|4^, and using the 
fact that M* — U 2 < Ui — M 2 ,fe for all k, we can obtain the conclusion ofTheoreml^ □ 

E Constrained Contextual Bandits with Unknown Context Distribution 

In this section, we relax the assumption of known context distribution and study unit-cost systems 
with unknown context distribution. Since the arrival of contexts is independent of the actions taken 
by the agent, a natural idea is to implement the ALP or UCB-ALP algorithm based on the empirical 
distribution as follows: 

EALP and UCB-EALP Algorithms: the agent maintains the empirical distribution of the contexts, 
denoted by itt = 7f2 ,t,..., frjt), where l(^t' = /)■ fn sach round, the agent 

executes the ALP (when the expected rewards are known) or UCB-ALP (when the expected rewards 
are unknown) algorithms with the context distribution tt in CVr.b replaced by 7 ft. These algorithms 
are referred to as Empirical ALP (EALP) and UCB-EALP, respectively. 

As we can see from the numerical simulations in Appendix [G} the above EALP and UCB-EALP 
algorithms have similar performance as ALP and UCB-ALP, respectively. However, the regret anal¬ 
ysis for these algorithms is challenging because the empirical distribution introduces complex tem¬ 
poral dependency since the empirical distribution depends on the context arrivals in all the past 
rounds, which makes it difficult to analyze the evolution of the remaining budget. Thus, we focus on 
the non-boundary cases and consider truncated version of EALP and UCB-EALP. Specifically, we 
study algorithms that stop updating the empirical distribution from the Ti-th (will be defined later) 
round and use the fixed estimate ttTi for the remaining rounds, which are referred to as EALP2 (as 
shown in Algorithm]^ and UCB-EALP2, respectively. We focus on the EALP2 algorithm for the 
case where the expected rewards are known, while the properties of UCB-EALP2 can be obtained 
by similar techniques in the analysis of UCB-ALP combined with the properties of EALP2. 


K r 


E 


K r 


iogr + o(i) 


E 


27 


L2^2(AW)2 


logT + 0(l). 


(41) 


Algorithm 3 EALP2 

Input: Time horizon T, budget B, learning stage length Ti, and expected rewards uj’s; 
Ink: T = T; & = = 0, Vj; 

for f = 1 to T do 
if f < Ti then 

TTj-t ^ 

end if 

if 6 > 0 then 

Obtain the probabilities pj(6/T)’s by solving C'Pr,b with tt replaced by 7 ft. 

Take action k*^ with probability px* ib/r). 

end if 
end for 


Now we show that for a sufficiently large T and an appropriate chosen Ti, the EALP2 algorithm 
achieves similar performance as ALP in the non-boundary cases. Let <5 = p, p—Qj(p)} 

be the gap between p and the boundaries. The following theorem shows that EALP2 achieves 0(1) 
regret with appropriately chosen T and Ti. 

Theorem 4. Given a fixed p G (0,1), p f gj, J = 1, 2,..., J — 1. IfTi = 16J^ log^ T/6'^ and T 
satisfies log^ T/T < (5^/(64J^), then the regret of EALP2 satisfies i?EALP 2 (T') B) — 0(1). 

Note that here we assume S is known for the simplicity of presentation. When considering practical 
scenarios where S is unknown, we can obtain a lower confidence bound of 6 as follows. At round 
t, let '^he empirical estimate of the cumulative probability. Lurther, let j((p) = 
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max{j : qj^t < p} be the threshold under the empirical estimate. Let 6 t = — p, p — 

and St = ^St- Then St is a lower confidence bound of S with P{(5 > ^t} > 1 — 

We choose Ti which is the smallest t such that < 7 ^ and t > 16 \og^ T/Sf. Then the 

following analysis holds, while the regret due to the event that S > St will be 0(1). Moreover, such 
a t will appear with high probability after 64 log^ T/S'^ rounds for the non-boundary cases, and 
not appear with high probability for the boundary cases. 

Similar to the non-boundary cases in Theorem [T] the key idea of proving Theorem is to show 
that under EALP2, the average remaining budget br/r will not cross the boundaries with high 
probability. To achieve this, we examine the expectation of hr jr and its concentration properties 
under EALP2. 

Step 1: Estimation error of ■nTf Let a = i5/(4JlogT). According to Hoeffding-Chemoff bound, 
we have 

2 / 

IP{|^LTi - tTjI < a,Vj} > 1 - (42) 

Step 2: Bound on the expectation of br/r. 

Lemma 7. Assume \T^j,Ti ~ ^ a for all j. Then, for all Ti < t < T, the expectation of the 

average remaining budget satisfies 

\E[br/T]- p\< (43) 

where t = T — t 1. 


Proof Eirst, we note that the average remaining budget brjr is close to the initial value p at round 
Ti, because we can verify that for all t < Ti, 


S B-Ti br B S 

p -< -^ < — < - < p + 

^ A-T-Ti + I-t-T-Ti + 1- ^ 4 


(44) 


Now we show by mathematical induction that, if Itt,- — ttJ < a for all 7 , then |E[6 ^/t1 — p\ < 
Jaj:l'=T-Tt+i ^ + I for T < T - Ti + 1 (i.e., t > Ti). 


Specifically, for t = Ti, we have — p\ < j according to Eq. ( |44| ). Eor any given t > Ti, 

we have t = T — t + 1, and 

E[br-l\br=b] 


= b- 


= b — b/r + 


‘'j{b/T) + l,Ti 
b/r - Ej<](b/r) 
rj(^b/T)+i,Tt 


j{b/r) + l 


E ^ 

j<j{b/T) 


^'^j{b/T) + l,Ti '^)(6/r)-|-l) E Ttj). 

3<j{b/T) 


Note that 0 < ^ Thus, |E[6 ^_i|&t- = b] — < Jq, implying that 

\nv5/\br = &] - ^ I < If \E[br/T] -p\<Ja E;=t-t,+i ^ + 3 for 2 < r < T - Ti + 1, 
then |E[&^_i/(t - 1)] - p\ < Ja Er'=T-Ti-ri P + ^ + i- 

The conclusion of Lemmathen follows because |E[6 t-/t] — p\ < -^ + ^ < 

JalogT -I- J < |. □ 


Step 3: Concentration of br/r. The next lemma shows the concentration of the average remaining 
budget br/r. 

Lemma 8. Assume \Ttjpr-^ — nf < a for all j. The average remaining budget br/r in round 
t = T — T + \ satisfies 

P{|^-E[^]|>V4}<2exp(-^). (45) 
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To show the concentration of br/r, we hrst use the coupling argument to show the following lemma 
and then use the method of averaged bounded differences 11^ . 

Lemma 9. Assume ~ '^j\ ^ oifor all j. The remaining budget in round t = T — t -\- 1 

satisfies 

where Zt'-i = {Zi, Z 2 ,..., and cr = 1 — min^ 


Proof. We bound the difference by constructing a coupling AA of the two conditional distributions 

= 1 ) and {■\Zti-i,Zt' = 0 ). Let Ct'+i, Ct'+2, • • ■, Ct-t and Ct'+2> ■ • ■ > Ct-t 

be the pair of random variables in the coupling AA. We construct the coupling as follows: 


Coupling: We generate the value of Ct"’s and sequentially. For each t" > t', let bx-f'+i = 


B-i- 


c 


and bf_^„^^ 


= B- 


E t — 1 


budgets in round t" 


'+1 ss — -T-t"+i - - ^.=1 zi.s=t'+i Cs be the remaining 

corresponding to the pair of random variables. For We pick its value 

randomly with distribution P{Ct" = 1} = = 0} = 1 — where 

h{p) is the probability that one unit of budget will be consumed under EALP2 when the average 
remaining budget is p, i.e., 


h{p) = 




^j(p)+i,Ti 


‘i(p)+i 


- E ^ 


(47) 


For we generate its value conditioned on (t"- If b'rp_^,,^^ = &T-t"+i, then = (^t"- If 
&T-t"+i = ^T-t"+i + 1, then 


lP{Ct" = Ct"I^T-t"+l = ^T-t"+l} = 1 , 

IP{Ct" = l|^T-t"+i = bT-t"+i + l,Ct" = 1} = Ij 




+ l — &T-t" + l + IjCt" — 0 } 


h{ 


b'T-t" + l \ _ ,/ bT-t" + l \ 

T-t" + ll "'\T-t" + ll 


l-h{ 


bT-t" + l t 

T-t" + l J 


Note that according to the above construction, b'rp_^„^i — bT-t"+i could only be 0 or 1. We can 
verify that the marginals satisfy 

(Ct'sf" > t') ~ {Zt„,t'' > t'\Zt'-i,Zt, = 1), 

and 

> t') ~ {Zt„A'' > t'\Zt,-uZt, = 0). 


From the construction of the coupling, we know that 6^ — 6 t = 1 if and only if C^t" = Ct" for all 
t' < t” <T — T. Thus, 

Zt' = 1] — ¥^\br\Zt'-i, Zt' = 0] I 

= P{Ct"=Ct",f'<f"<T-r} 

T-t 

= n P{Ct"=C;'IC=Cs,f'<s<f"-l}- (48) 

i"=t'+l 

We show that each term in Eq. ( |48| ) can be bounded as follows. 

Lemma 10. The coupling AA satisfies 

P{Ct" = = Ct' <s<t'' -i] <1- 

where cr = 1 — min, . 

J TZ j-\-Ot 
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Proof. Conditioned on Cs = Cs) ® ^ ~ 1’ we have = bT-t"+i + 1, and (^t" ^ Ct>> 

i.f.f. Ct" = 0 and Q" — 1- Thus, 


P{G" = Ct'"IC. = C,t' <s<t'' -1} = 1- 


"'\T-t" + l) "'\T-t" + lJ 


= 1 - 


h{ 


1 _ + 

-L "■VT-t" + l/ 

K 


1 - h{ 


br-f'+i \ 
T -t" + V 


'T-t" + l \ I,/ ^T-t" + l 


T -t" + I 


)-h{ 


T -t" + V 


(49) 


To prove Lemma 10 it suffices to show that for any h and r satisfying b < r — 1, we have — 

h{^)> i/t, where 7 = min^- 

Specifically, from the definition of h{-) in Eq. ( |47] i, we know that if j(^^) = ^( 7 ), we have 

If j{^) > i(y), we have ^ ^i.Ti < ^)(A)+i,Ti and ^j.Ti > 0. 

Therefore, 


K^)-h(^) 


= £ + 

i=i(7)+i 

)(^) 

= £ + 

f = i (|)+2 

j('=±i) 


&+1 _ 
r ^ 








7 


“ TT ;,-6 


J(7)+1 




f=i(|)+2 




i(*^) 




&+1 


^ + 7 - - 51 ^ 7/r. 


i = i (")+2 


- E 

3<3C-f^) 

(50) 

□ 


^i,Ti 


Using Lemma 10 we have 

|E[6r Zt! = 1] — ^\br\ZtiZf! = i 


< 


T-t 


n (1 


1 — a 




T-t" + 


(a) r + cr 
T-t' 


T-t'-T-l 


S = 1 


n ( 1 +^) 


(^) r + cr,r — f' — l,cr 

< 2 ( ^ 

Equality (a) is obtained by merging the numerator of each term with the denominator of the next 
term. Inequality (b) is true because cr < 1, and 

T-t'-r-l T-t'-T-l T-t'-T-l rp ,/ , 

log Yl (1 + ^^)= y log(l + ^^)< 51 ^^<Crl 0 g(-)• 

J.J. \ T + S T + S T + S T ' 

S=1 s — 1 s — 1 

□ 


To use the method of averaged bounded differences 12^ . we note that 


T-t 


E 

t'=Ti ‘- 


2 ( 


T , l-(7 


T-f" 


< 4r 


2-2q 


_ 

L-l-2cr 


1 


(T - Ti + 1) 


T3^] ^ 4r. 


(51) 
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Then, according to Corollary 5.1 in and Lemmawe have 

P{| 6 t- -E[&^]| >( 5 t/ 4 } < 2exp(- =2exp(- ^). 

implying Eq. ( |45] l in Lemma|^ 

Step 4: Upper bound of Realp 2 {T, B). Now we bound the regret i?EALP 2 (r, B) using the results 
obtained in the previous steps. We analyze the event of “boundary-crossing” in round t, denoted 
as Across,t, which is the event that j^br/r) ^ j{p)- The event of “boundary-crossing” may happen 
when the estimates of empirical distribution is inaccurate or the average remaining budget br/r 
deviates far from p. We study the probability of Across,t for t <Ti and t > Ti, respectively. 

For t < Ti, the average remaining budget satisfies p — S/A < b^/T < p -f 5/4, as discussed in 
Step 2. The event Across,t may occur only when there is some j such that \TTj^t — '^j\ > <^/(4'/). 
Thus, 

P{£cross,t} < P{3j, iTTj.t - TTjl > 5/(4J)} < 2Jexp(-5^f/8J^), t < Ti. (52) 

For t > Ti, if the empirical distribution < oi {< 5/(4J) for sufficiently large T) for 

all j, then the average remaining budget satisfies P{ | ^ — E[^] | > 35/4} < 2 exp ( — due to 
Lemma 1^ and Lemma 1^ Thus, 

P{fcross,t} < P{3j, Ittj-.Ti - ttjI > a} -f P{|^ - E[^]| > 35/4|Vj, \Trj^Ti - T^j\ > «} 

< ^+2exp(-^), f>ri. (53) 


Now we bound the expectation of Cj (T), i.e., the number of executions under context j. 
For j < j{p). 


r T 


E[Q(T)] = E 


> 


Y^t{Xt=j,At=k*) 

At = fc*p£^cross,t}IP{^^cross,i} 
T 

^7rj (l - P{^cross,t}) 




Ti 


^ 7r^T-^2Jexp(-5^f/8J^)- ^ -f 2 exp ( - p-)] = tt^-T- f 0(1), 


t=i 


t=Ti + l 


rp2 
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and 


E[0,(T)] < tt.T, 


Similarly, for j > j{p) + 1, we have 

T 

E[0,(T)] < Y,nXt=J,At = fc}|5cross,t}P{fcross.t} = 0(1). 
t=l 

For j = j (p) -f 1, we have 


(54) 


E[C,{T)]=E[B-bo]- nCM>B-T Y. ^^._0(l) = (p- 7t,)T-0{1). 

j¥=3{p) + '^ 3<3{P) 3<3{p) 

We complete the proof of Theorem |^by summing over all contexts: 

,7 

C/EALP2(T,i3) = y]p*E[0,(T)] > Tb{p) - 0(1) = U{T,B) - 0(1). 

7=1 
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F Constrained Contextual Bandits with Heterogeneous Costs 

In this section, we consider the case where the cost for each action k under context j is hxed at 
Cj^k, which may be different for different j and k. We discuss how to use the insight from unit-cost 
systems in heterogeneous-cost systems. 


F.l Approximation of the Oracle Algorithm 

Similar to unit-cost systems, we first stu^ the case with known statistics. We generalize the upper 
bound and the ALP algorithm in Sectionl^to general-cost systems. 


F.1.1 Upper Bound 


With known statistics, the agent knows the context distribution tt^ ’s, the costs Cj^^’s, and the expected 
rewards Uj^k^- In heterogeneous-cost systems, the quality of a context-action pair (j, k) is roughly 
captured by the normalized reward, denoted by ^ However, unlike the unit-cost 

case, the agent cannot only focus on the “best” action with highest normalized reward, i.e., k* = 
argmaxj, rjj k, when making a decision under context j. This is because there may exist another 
action k such that rjj^k < ''lj,kp but uj^k > (and of course, Cj^k > If there is sufficient 

budget allocated for context j, then the agent may take action k to maximize the expected reward. 
Therefore, the agent needs to consider all actions under each context. Let pj^k be the probability that 
action k is taken under context j. We define the following LP problem: 




J K 


maximize ttj Pj,kUj^k, 

(55) 

j = l k=l 


.7 K 


subject to TTj ''^^Pj,kCj,k < B/T, 

(56) 

3 = 1 k=l 


K 


'^Pj,k < 1, Vj, 

(57) 

k^l 


P3,k G [0,1]. 



The above LP problem CV'rp g can be solved efficiently by optimization tools. Let v{p) be the 
maximum value of CV'g g. Similar to Lemma we can show that Tv{p) is an upper bound of the 
expected total reward, i.e., Tv{p) > U*{T, B). 

To obtain insight from the solution of CP'g g, we derive an explicit representation for the solution 
by analyzing the structure of CVg g. Note that there are two types of (non-trivial) constraints in 
one is the “inter-context” budget constraint ( [56l l, the other is the “intra-context” constraint 
These constraints can be decoupled by first allocating budget for each context, and then solving 
a subproblem with the allocated budget constraint for each context. Specifically, let pj be the budget 
allocated to context j, then CV'g g can be decomposed as follows: 


where 


.7 

maximize E 

7 = 1 
.7 

subject to < B/T, 

7 = 1 

K 

(SVj) Vj{pj) — maximize ''^Pj,kUj^k, 

k=l 

K 

subject to E Pj,k^j,k ^ Pjj 

k=l 


(58) 

(59) 
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(60) 


K 


^^Pj,k < 1 ) 


/c=l 


Pj,k G [0, 1] • 


Next, by analyzing sub-problem SVj, we show that some actions can be deleted without affecting 
the performance, i.e., the probability is 0 in the optimal solution. 

Lemma 11 . For any given pj > 0, there exists an optimal solution of SVj, i.e., p* = 
iPj,i^Pj,2^ ■ ■ ■ >P*j,k)’ satisfies: 


( 1 ) For ki, if there exists another action k2, such that Pj^ki < Vj.k^ and < Uk2, then p* = 0 ; 

( 2 ) For ki, if there exists two actions k2 and k^, such that r]j^k2 S: Vj,ki fi Vj,k3, Uj,fc2 ^ ^ 

and then p* = 0 . 


'-tj,k3’ 


Intuitively, the first part of Lemma 11 shows that if an action has small normalized and original 
expected reward, then it can be removed. The second part of Lemma [TT] shows that if an action has 
small normalized expected reward and medium original expected reward, but the increasing rate is 
smaller than another action with larger expected reward, then it can also be removed. 


Proof. The key idea of this proof is that, if the conditions is satished, and there is a feasible solution 
Pj ~ iPj,t^Pj,2y ■ ■ ■ tPj.k) such that Pj^ki > 0. then we can construct another feasible solution p' 
such that pt = 0 , without reducing the objective value vj {pj ). 

We first prove part (1). Under the conditions of part (1), if p^ is a feasible solution of SVj with 
Pj^kx > 0. then consider another solution p', where p'- f. = pj^k for k ^ {fci, /C 2 }, p' = 0, and 
P'j k 2 ~ PjM ~^PjM min{ 1 }. Then, we can verify that p' is a feasible solution of {SVj), and 
the objective value under p' is no less than that under py 


For the second part, if the conditions are satisfied and pj M > 0 ’ then we construct a new solution p' 
by re-allocating the budget consumed by action ki to actions ^2 and k^, without violating the con¬ 
straints. Specifically, we set the probability the same as the original solution for other actions, i.e., 
p'j k — Pj,k for k f. {fci, k 2 , fca}, and setp' for action fci. For ^2 and ^ 3 , to maximize the ob¬ 

jective function, we would like to allocate as much budget as possible to ^3 unless there is remaining 
budget. Therefore, we set = PjM = PiM + E/c/fci P],k + 

PjM = PjM + - J -and p^. ^ = pj-fe3 -b- ^ -- 


if J2k^ki Pj,k 


Pj>ki ^j,ki 


^ 3,^2 Cj-.feg 


> 1. We can verify that p satisfies the constraints of {SVj) but the 


objective value is no less than that under p 


3' 


□ 


With Lemma [TT[ the agent can ignore some actions that will obviously be allocated with zero prob¬ 
ability under a given context j. We call the set of the remaining actions as candidate set for context 
j, denoted as Aj. We propose an algorithm to construct the candidate action set for context j, as 
shown in Algorithmic 

For context j, assume that the candidate set Aj = , fc^, 2 , • • • > kj^Kj} has been sorted in descend¬ 

ing order of their normalized rewards, i.e., Pj,kj 1 > Vj,kj 2 — • ■ ■ — Vj,kj k ■ From Algorithmj^ we 
know that Uj^kj^i < < ■ ■ • < Uj^kj^K^ > ^rid Cj^kj^i < Cj,kj ^2 < ■ ■ ■ < Cj,kj^Kj ■ 

The agent now only needs to consider the actions in the candidate set Aj. To decouple the “intra¬ 
context” constraint we introduce the following transformation: 




f - Pj,k,,„+i , if 1 < a < iTj - 1 , 
\P 3 ,kj,Kj > if a = Kj, 
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Algorithm 4 Find Candidate Set for Context j 

Input: Cj^s, Uj^s, for all 1 < fc < iF; 

Output: Aj', 

Ink: Aj = {1,2,... ,K} ■ 

Calculate normalized rewards: rjj k = Uj^k/cj^k', 

Sort actions in descending order of their normalized rewards: 

'nj,ki ^ Vj,k2 ^ ^ Vj,kK 

for a = 2 to iC do 

if 3a' < a such that Uj^ka ^ '’J'j,k , then 

Aj = 

end if 
end for 

a = 1; 

while a < AT — 1 do 

Find the action with highest increasing rate: 




a = arg max 

a':a'GjXj ^j,k^/ ^j,ka 


Remove the actions in between: 

Aj = Aj\{ka' : a < a < a*}. 

Move to the next candidate action: a = a*', 

end while 


where pj^k „ G [0,1], andpj ^ ^ > pj^k „+i for 1 < a < Ffj — 1. Substituting the transformations 
into [SVj) and reorganize it as 


where 


_ K, 

{SVj) maximize 

a—1 

Kj 

subject to E Pj,kj^Aj,kj_a — Pj’ 

a—1 


P3,kj,a — Fi,fej,a + 1) 1^0^ Kj 1) 

Fj.fcj.a G [0,1], Va, 




^j,kj^i 5 

if a = 1, 

lij ^ kj a 5 

if 2 < a < ifj, 

^j,kj,n 

if a = 1, 

.^3,kj,cL ~ „_i, 

if 2 < a < ATj. 


(61) 


Next, we show that the constraint can indeed be removed. For each kj^a, we can view Cj^kj „ 
and Uj^kj a '^6e cost and expected reward of a virtual action. Let fjj^kj „ = Uj,kj a,/^j,kj „ be the 
normalized expected reward of virtual action fc, a- For a = 1, using , we can show 


j ^ a — 1 


> 




that > fij,k„ 2 - For 2 < a < K, - 1, using ^ > 

show that fjj^kj^^ > ?7j.fc,,„+i- In other words, we can verify that > ■ • ■ > 


Thus, without constraint ( |6T] l, the optimal solution p* = [p* i,^,p* f.^,... ,p* ] automatically 

satisfies p* >P*jk 2 — ■ • ■ — Pj kK ■ Flence, we can remove the constraint ( [6T] i, and thus decouple 


the probability constraint under a context. 
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With the above transformations, we can thus rewrite the global LP problem 

_^ ./ K, 

{C'Prp b) maximize EE 

j=l a=l 

J K, 

subject to EE 

j=l a=l 

P 3 ,kj,a G [0,1], Vj, and 1 < a < Kj. 

The solution of CVj’ g follows a threshold structure. We sort all context-(virtual-)action pairs (j, ka) 

in descending order of their normalized expected reward. Let be the context index and 

action index of the *-th pair, respectively. Namely, 77 ^( 1 ) ^.(i) > 77 ^( 2 ) ^( 2 ) > ... > 

where M = J2j=i '^he total number of candidate actions for all contexts. Define a threshold 
corresponding to p = B/T, 


i{p) = max{i : ^ 7 r^( 7 ')Cj<.'),fc(-') < P}> 


(62) 


where p = B/T is the average budget. We can verify that the following solution is optimal for 

'cp' 


T,B- 


P,(i) ,fe(i)(p) = 




if 1 < 7 < i{p), 
, if7 = 7(p) + l, 

''j(i(p) + l)‘'j(i{p) + l),fe(i(p) + l) 

^ 0 , if 7 > i{p) + 1 . 


Then, the optimal solution of CVg g can be calculated using the reverse transformation from 

Pj.kipYS' tOpj,fc(p)’s 


F.1.2 ALP Algorithm 

Similar to unit-cost systems, the ALP algorithm replaces the average constraint B/T in CV'rp g with 
the average remaining budget h-y/r, and obtains probability pj^kibr/T). Under context j, the ALP 
algorithm take action k with probability p^kibr/T). 

Unlike unit-cost systems, the remaining budget by does not follow any classic distribution in 
heterogeneous-cost systems. However, we can show that the concentration property still holds for 
this general case by using the method of averaged bounded differences 12 ^ . 

Lemma 12. For 0 < <5 < 1, there exists a positive number k, such that under the ALP algorithm, 
the remaining budget by satisfies 

¥{by > {p + 5)t} < 

V{by < {p-S)t} < 


Proof. We prove the lemma using the method of averaged bounded differences ll23l . The process 
is similar to Section 7.1 in 1231 . except that we consider the remaining budget and the successive 
differences of the remaining budget are bounded by Cmax- 

Specifically, let ct', 1 < t' < T be the budget consumed under ALP, and let cg = (ci, £ 2 ,..., cg)- 
Then the remaining budget at round t (the remaining time t = T — t + l), i.e., br-t+i is a function 
of C(. We note that under ALP, the expectation of the ratio between the remaining budget and the 
remaining time does not change, i.e., for any b < J2j=i (here c* = max^ Cj^k), if by = b, then 
E[6 t— i/(t — 1)] = b/r. Thus, we can verify that for any 1 < L < f, we have 

E[6T-t-i-i|ct'] = br-g+i - y ^ 
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Note that Ab = bT-t '+2 - br-t'+i < Cmax and &T-i '+2 > -Cmax, we have 

|E[6T-t+i|ct'] — E[6'r-t+i|ct'_i] I 
&T-t'+2 


Moreover, 


< 


< 


max 

0<Ab<c„ax 


A&- 


T -t' + 2' \ T -t' + 1 


T-t + 1 


‘2Cnis^x(T — t + 1) 

T-t' + 1 


E 

t'^1 


r2Cniax(r — ^ + 1)-|2 

^ T-t' + 1 ^ 




= - (+ 1 )" 


T'=T-t + l 


(rO^ 


4cLx(^-^ + l)^ / 

JT-t+l ) 


— 4c^ax(^~^+l) 


T-t+1 

t- 1 


r 


According to Theorem 5.3 in 123, and noting t = T — i + 1, E[ 6 i-] = pr, we have 

_ 2T{<5p'r)2 _ _ s'^ p'^ 

V > E[6^] +St} <e '“=ma,x(^-‘ + l)(‘-l) < e + < g 5^ 


and similarly, 


P{5t- < E[ 6 ,-] — (5r} < e 


5±p± 


Choosing k = — concludes the proof. 


( 64 ) 


(65) 

( 66 ) 

(67) 

□ 


Then, using similar methods in Sectionj^ we can show that the generalized ALP algorithm achieves 
0 ( 1 ) regret in non-boundary cases, and 0{VT) regret in boundary cases, where the boundaries are 
now defined as 7 i'j( 4 ')Cj( 4 '),/c( 4 ')- 


F.2 e-First ALP Algorithm 

When the expected rewards are unknown, it is difficult to combine UCB method with the proposed 
ALP for general systems. As a special case, when all actions have the same cost under a given 
context, i.e., Cj^k = Cj for all k and j, the normalized expected reward rjj^k represents the quality of 
action k under context j. In this case, the candidate set for each context only contains one action, 
which is the action with the highest expected reward. Thus, the ALP algorithm for the known 
statistics case is simple. When the expected rewards are unknown, we can extend the UCB-ALP 
algorithm by managing the UCB for the normalized expected rewards. 

When the costs for different actions under the same context are heterogeneous, it is difficult to 
combine ALP with the UCB method since the ALP algorithm in this case not only requires the 
ordering of p, ^’s, but also the ordering of u, fc’s and the ratios —ILAa. propose an e-First 

ALP Algorithm that explores and exploits separately; the agent takes actions under all contexts in 
the first e(T) rounds to estimate the expected rewards, and runs ALP based on the estimates in the 
remaining T — e{T) rounds. 

For the ease of exposition, we assume Cj^ki ^ Cj,k 2 for any j and fci 7 ^ ^ 2 and let be the 
minimal difference, i.e., 

^min= min UCjM-CjmI}- 

kiMeioyuA 

^For the case with Cj+j = Cj+j for some j and ki 7 ^ k 2 (and Wj+i 7 ^ Uj^ko), we can correctly remove the 
suboptimal action with high probability by comparing their empirical rewards hj+i = ■ 
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Algorithm 5 e-First ALP 

Input: Time horizon T, budget B, exploration stage length e(T), and Cj fe’s, for all j and fc; 
Init: Remaining budget b = B-, 

^j,k — 0 ? = 0 , 

for t = 1 to e(T) do 
if & > 0 then 

Take action At = argminj,g _4 Cxt,k (with random tie-breaking); 

Observe the reward YAt,u 

Update counter Cxt,At = Cxt,At + update remaining budget b = b — cxt,At', 
Update the reward estimate: 


'l^Xt.At 


{Cxt,At - ^)uXt,At + YAt,t 
Cxt,At 


end if 
end for 

for t = e(T) + 1 to T do 

Remaining time t = T — t + 1', 

if & > 0 then 

Obtain the probabilities pj^kib/rYs by solving the problem {CV'^ {,) with uj^k replaced by 

, k 7 

Take action k with probability pxt,k{b/T)', 

Remaining budget b = b — cxt^At'^ 

end if 
end for 


Let ^j,kx,k 2 = for j G X, ki,k 2 € {0} U A, and ki 7 ^ ^2 (recall that Ujp = 0 and 

Cjfi = 0 for the dummy action), Cj.fci,fc 2 be its estimate at the end of the exploration stage, i.e.. 


^j,ki,k2 — 

i.e.. 


^j,ko 




, k 1 ^ J , k 9 


. Let be the minimal difference between any Cii,feii,fci 2 ^ 


.j2,k21,k22^ 


= 


min {\^j 

3l ,32^^ 

fell 5^12 5 ^ 211^22 ^{ 0 } U -4 


02 5 ^ 215^22 


|}- 


Moreover, let TTmin = mAij^x Xj and let A* = Then, the following lemma states that 

under e-First ALP with a sufficiently large e(T), the agent will obtain a correct ordering of ,k 2 s 
with high probability at the end of the exploration stage. 

Lemma 13. Let 0 < <5 < 1. Under e-First ALP, if 


e(T) = 


K , ^ .1 16iL ' 

(1 - ^ og (l-<5)7r^i„(A*)2} 


then for any contexts ji, j 2 € X, and actions kn, ki 2 , k 2 i, k 22 € {0}UA if^ji,kii,ki 2 < ?j 2 .fc 2 i,fc 22 > 
then at the end of the e{T)-th round, we have 



A Cj2,fe21,fe22 


< {J + A)T-^. 


Moreover, the agent ranks all the ^jki,k 2 ’s correctly with probability no less than l — {‘iK+l)JT 


Proof We first analyze the number of executions for each context-action pair (j, k) in the explo¬ 
ration stage. Let Nj = t(Xt = j) be the number of occurrences of context j up to round 

e(r). Recall that the contexts Xt arrive i.i.d. in each round. Thus, using Hoeffding-Chernoff Bound 
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for each context j, we have 


Vj G X,Nj > (1 - 5)TTje{T) 


>l-^p|iV, <(l-5)7r,e(r)| 
.7 = 1 ^ 


> 1 _ 

> 1 -Je- 2 i°g'r 

= 1 - jt -2 


( 68 ) 


On the other hand, the lower bound (1 — 5)'Kje{T) > K -\- From the implementation of 

the exploration stage in Algorithm]^ we know that if Nj > (1 — 5)'Kje(T), then 


Therefore, 


^ li , 16 logT I ^ leiogT ^ ^ 

> [1 + (^*^2 J - (^*)2 > ^ 


Vj e x,yk e A,Cj^k > ^^*^2 I 
> 1 - JT -2 


(69) 


(70) 


Next, we study the relationship between the estimates ^ji,kii,ki 2 ^n,k 2 i,k 22 '^he end of the 
exploration stage. We note that 


Cil,fcll,fcl2 — ^j2,k21,k22 


^ {ijl,kii,ki2 ijl,kii,ki2 

( 02 ,^ 21.^22 02 ,^ 21,^22 V 


02 .^ 21,^22 Ol.fcll.fcl 


02 ,^ 21.^22 Ol.fcll.fel 


-)>0 




I%l,fcll %l,fcll ^j2,k21,k22 0l,fell,fel2 


-( 
-( 
+ ( 


k^jl,kii Cjl,kl2 4 

^jl,kl2 ~ '^jl,kl2 I ij2,k21,k22 ~ Ol.fell.fcl 


''il.fell ^jl,k\2 4 

0'2,fe21 ~ '^j2,k21 I ij2,k21,k22 ~ Ol.^ll.^l 


*'^ 2.^21 ~ ^32,k22 4 

0'2,fc22 ~ '^j2,k22 _ ij2,k21,k22 ~ Ol.fcll.fcl 

^32,k2l ~ ^jl,k22 ^ 


■)> 0 . 


(71) 


Thus, for the event 0i,feii,fci2 ^ 02,fc2i,fc22 to be true, we require that at least one term (with the 


sign) in the last inequation above is no less than zero. Conditioned on Cj^k > 


16 log T 

(A .)2 


, we can bound 


the probability of each term according to the Hoeffding-Chernoff bound. For example, for the hrst 
term, we have 


'^jl,kll ttjl.fcll ^j2,k2l,k22 Ol.fcll.fcl 


t-fl.fcl 

|Cji,fcii > 


t'jl,fcl2 

16 log T 


> 0 


{A*y 


} 


— ^{t^ii.fcii — V A lOt'i.fcii — !/\*\2 } 


(A*)2 


< e 


-21ogT 


= T- 


The conclusion then follows by considering the event — , Vj G <T,Vfc G A’} and its 

negation. □ 
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Theorem 5 . Let 0 < 5 < 1. Under e-First ALP, if 

/ N K , r 1 IQK 

e[T) > - -^7-hlogTmax-^ — 

(l-OjTTmin I (1 “ <5)7rmi„(A*)"^ 

then the regret of e-First ALP satisfies: 

1) if p = B/T ^ Qi, then i?£_FirstALp('r, B) = O(logr); 

2) ifp = B/T = Qi, then i?£_FirstALp('r, B) = 0{s/T). 


Proof. (Sketch) The key idea of proving this theorem is considering the event where the ^^ _fc 2 s are 
ranked correctly and its negation. When the ^j.ki,k 2 s are ranked correctly, we can use the properties 
of the ALP algorithm with modification on the time horizon and budget (subtracting the time and 
budget in the exploration stage, which is 0(log T)); otherwise, if the agent obtains a wrong ranking 
results, the regret is bounded as 0(1) because the probability is 0(T“^) and the reward in each 
round is bounded. 


F .3 Deciding e{T) without Prior Information 

In Theoremj^ the agent requires the value of A* (in fact A^j^^ because is known) to calculate 
e(r). This IS usually impractical since the expected rewards are unknown a priori. Thus, without 

the knowledge of we propose a Confidence Level Test (CLT) algorithm for deciding when to 
end the exploration stage. 

Specifically, assume > 0 and is unknown by the agent. In each round of the exploration stage, 
the agent tries to solve the problem {CP/ f) with Uj^k replaced by Uj^k using comparison, i.e., using 
Algorithm ffl and sorting the virtual actions. For each comparison, the agent tests the confidence 
level according to Algorithm]^ If all comparisons pass the test, i.e., f lagSucc = true for all 
comparisons, then the agent ends the exploration stage and starts the exploitation stage. 


Algorithm 6 Confidence Level Test (CLT) 

Input: Time horizon T, estimates Cii,fcii.fci 2 ’ number of executions 

^j2,k21’ ^j2,k22’ 

Output: f lagSucc; 

Init: flagSucc = false, 

/\f _ ^niin(^jl ~^j2 >^21^^22 ) . 

^ ~ 2 

if g-2(A')2min{Cji,fcjj,Cj^,fcj2} < “'"^‘^ 32 A21 ■C'j2A22> < then 

flagSucc = true; 

end if 

return flagSucc; 


Next, we show that the e-First policy with CLT will achieve 0(log T) regret except for the boundary 
cases, where it achieves Ois/T) regret. On one hand, according to Hoeffding-Chernoff bound, if all 
comparisons pass the confidence level test, then with probability at least 1 — the algorithm 

obtains the correct rank and provide a right solution for the problem {CP/ f). On the other hand, 
because A* > 0, from the analysis in the previous section, we know that the exploration stage will 
end within 0(log T) rounds with high probability. Therefore, the expected regret is the same as that 

in the case with known a|^1 

G Numerical Experiments 

In this section, we evaluate the regret of the proposed algorithms through numerical simulations. We 
study the performance of the proposed algorithms here for unit-cost systems as the parameter setting 
is relatively simple to control while providing us useful insights. The performance in heterogeneous- 
cost systems is similar as we have shown theoretically, and omitted here. In the case with known 
statistics, we compare the proposed PB (two-context case) and ALP algorithms with Fixed LP (FLP) 
algorithm that uses a fixed average budget constraint B/T since both ifTTl and EOll use fixed aver¬ 
age budget constraint. Then, the UCB-based FLP, i.e., UCB-FLP, is evaluated in the case without 
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knowledge of expected rewards. We also evaluate algorithms for the case without knowledge of 
context distribution. When the context distribution is unknown to the agent, we use the Empirical 
ALP (EALP) algorithm, that uses the empirical distribution (histogram) of context for making de¬ 
cisions, in the case with known expected rewards. Then, the UCB-based EALP is proposed for the 
case without knowledge of expected rewards. The results are averaged from 5,000 independent runs 
of the simulations. 

G.l Two-Context Systems 



Horizon T x10^ Horizon T xIO^ Horizon T 


(a) 


(b) 


(c) 


Figure 1: Comparison of algorithms for the two-context systems with perfect knowledge (tti = 
0.4,712 = 0.6), (a) p = 0.39, (b) p = 0.4, (c) p = 0.41. 

We first consider a two-context scenario with K = 3 arms and Bernoulli rewards; the context 
distribution vector is tt = [0.4,0.6], the expected rewards are Ui = 0.8 x [1/3, 2/3,1] for context 
1, and U 2 = 0.4 x [1/3, 2/3,1] for context 2. The boundary is qi = tti = 0.4 and we study the 
cases with normalized budget p = 0.39,0.4, and 0.41, respectively. 

Figure shows the regret of different algorithms in the case with known expected rewards. In the 
non-boundary cases (i.e., p = 0.39,0.41), the ALP algorithm achieves near optimal performance. 
Even without the knowledge of context distribution, the EALP algorithm performs much better than 
FLP In the boundary case, i.e., p = 0.4, the regret of ALP increases with T but is still lower than 
that of FLP. The EALP algorithm achieves higher regret than ALP and FLP due to the empirical 
distribution errors. 



Figure 2: Comparison of algorithms for the two-context systems without perfect knowledge (tti = 
0.4,712 = 0.6), (a) p = 0.39, (b) p = 0.4, (c) p = 0.41. 

Figurej^shows the regret of different algorithms in the case without knowledge of expected rewards. 
We can see that in the non-boundary cases, UCB-ALP and UCB-EALP achieves regret that is very 
close to UCB-PB and outperforms UCB-FLP. Interestingly, we can even see that UCB-ALP achieves 
slightly lower regret than UCB-PB in the case with p — 0.41. This is because under UCB-PB, the 
better context may be skipped and wasted if it does not have the highest UCB. In contrast, the UCB- 
ALP algorithm may allocate certain resource to the better context, even when it does not have the 
highest UCB. On the boundary case, the regrets of UCB-ALP and UCB-EALP become larger than 
that of UCB-PB, but are still sublinear in T. 

G.2 Multi-Context Systems 

Next, we study a multi-context scenario with J — 10 contexts, K = 5 arms, 

and Bernoulli rewards. Specifically, the context distribution vector is tt = 


35 















































[0.025,0.05,0.075,0.15,0.2,0.2,0.15,0.075,0.05,0.025]. The expected reward of action k 
under context j is Uj^k = One boundary in this system is = 0.5. We study the cases with 

average budget p = 0.49,0.5, and 0.51, respectively. In this case, it is difficult to calculate the 
expected total reward obtained by the oracle solution. Thus, we calculate the regret by comparing 
with the upper bound, i.e., U(T, B) = Tv{p). 
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Figure 3: Comparison of algorithms for the multi-context systems with perfect knowledge (Qs = 
0.5), (a) p = 0.49, (b) p = 0.5, (c) p = 0.51. 

Figure]^ shows the regret of different algorithms in the case with known expected rewards. In the 
non-boundary cases, both the ALP and EALP algorithm achieve similar performance as in the two- 
context case. The regret of EALP is even lower than ELP in the boundary case, since the ratio of 
contexts that are executed with correct probability is higher than that in the two-context systems. 
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Eigure 4: Comparison of algorithms for the multi-context systems without perfect knowledge (Qs = 
0.5), (a) p = 0.49, (b) p = 0.5, (c) p = 0.51. 

Eigurej^shows the regret of different algorithms in the case without knowledge of expected rewards. 
We can see that all algorithms achieve sublinear regret, but the difference between the non-boundary 
cases and the boundary case is small. This is rooted in the fact that when the number of contexts 
and the number of actions are large, it requires more time to learn the expected rewards. Hence, the 
constant in the log T term is much larger than that in the y/T term, and the log T term dominates the 
regret and the impact of the y/T term could be small. E^loring the structure of the reward function 
in contextual bandits, e.g., similarity m and linearity IS), to reduce the exploration time is part of 
our future work. 
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